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Abstract 

Multimedia uploaded content is tagged and recom- 
mended by users of collaborative systems, resulting 
in informal classifications also known as folksonomies. 
Faceted web ranking has been proved a reasonable 
alternative to a single ranking which does not take 
into account a personalized context. In this paper we 
analyze the online computation of rankings of users 
associated to facets made up of multiple tags. Possi- 
ble applications are user reputation evaluation (ego- 
ranking) and improvement of content quality in case 
of retrieval. We propose a solution based on PageR- 
ank as centrality measure: (i) a ranking for each tag 
is computed offline on the basis of the correspond- 
ing tag-dependent subgraph; (ii) a faceted order is 
generated by merging rankings corresponding to all 
the tags in the facet. The fundamental assumption, 
validated by empirical observations, is that step (i) is 
scalable. We also present algorithms for part (ii) hav- 
ing time complexity 0{k), where k is the number of 
tags in the facet, well suited to online computation. 



1 Introduction 

In collaborative tagging systems, users assign key- 
words or tags to their uploaded content, or book- 
marks, in order to improve future navigation, filter- 
ing or searching (see, e.g., Marlow et al. jMNBDOB] ). 
These systems generate a categorization of content 
commonly known as a folksonomy. 

An example is the collaborative URL tagging sys- 
tem Delicious [Del] . which was analyzed in depth 



by Golger and Huberman |GH06| . discovering tem- 
poral stability in the relative proportions of tags 
within a given tagging subject. In this system In- 
ternet resources (URLs) are bookmarked and classi- 
fied with tags by users. Other two well-known col- 
laborative tagging systems for multimedia content 
are YouTube [Youj (videos) and Flickr [FIT (photos), 
which are the focus of this paper. 

YouTube and Flickr differ from Delicious in that 
the resources are uploaded by users, so that all con- 
tents bookmarked as favorites are inside the system. 
That is, YouTube and Flickr can be considered closed 
systems. 

Users can be ranked in relation to a tag or set of 
tags which we call a facet. Some applications of these 
faceted (i.e., tag-associated) rankings are: (i) search- 
ing for content through navigation of the best users 
inside a tag- facet; (ii) measuring reputation of users 
by listing their best rankings for different tags or tag 
sets. 

The order or ranking can be determined by a 
centrality measure, such as PageRank [PBMW981 
ILM03| ■ in a recommendation or subscription graph. 
Given a facet, a straightforward solution is to com- 
pute the centrality measure based on an appropri- 
ate facet-dependent subgraph of the recommenda- 
tion network. However, the online computation of 
the centrality measure is unfeasible because its high 
time complexity, even for small facets with two or 
three tags. Moreover, the offline computation of the 
centrality measure for each facet is also unfeasible 
because the large number of possible facets. There- 
fore, alternative solutions must be looked for. A sim- 
ple solution is to use a general ranking computed 
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offline, wliich is then filtered online for each facet 
query. Using a single ranking of web pages or users 
within folksonomies has the disadvantage that the 
best ranked ones are those having the highest central- 
ity in a global ranking, which is facet-independent. In 
the information retrieval case, this implies that the 
returned results are ordered in a way that does not 
take into account the focus on the searched topic. 
This problem is called topic drift |RD02j . 

In this paper we propose a solution to the problem 
of topic drift in faceted rankings which is based on 
PageRank as centrality measure. Our approach fol- 
lows a two-step procedure: (i) a ranking for each tag 
is computed offline on the basis of the corresponding 
tag-dependent subgraph; (ii) a faceted order is gen- 
erated by merging rankings corresponding to all the 
tags in the facet. 

The fundamental assumption is that step (i) in 
this procedure can be computed with an acceptable 
overhead which depends on the size of the dataset. 
This hypothesis is validated by two empirical obser- 
vations. On one hand, in the studied recommenda- 
tion (tagged) graphs most of the tags are associated 
to very small subgraphs, while only a small number of 
tags have large associated subgraphs (see Section [3]). 
On the other hand, the mean number of tags per edge 
is finite and small as explained in Section 

The problem then becomes to find a good and efh- 
cient algorithm to merge several rankings in step (ii) . 
In SectionlH we present several alternatives. We con- 
centrate our effort on facets that correspond to the 
logical conjunction of tags [match- all-tags-queries) 
because this is the most used logical combination in 
information retrieval (Christopher [ChrOSj . Chapter 

l). 

The rest of the paper is organized as follows. We 
discuss prior works and their limitations in Section [2l 
In Section [3] we explore two real examples of tagged 
graphs. In particular, we analyze several important 
characteristics of these graphs, such as the scale-free 
behavior of the vertex indegree and assortativeness 
of the embedded recommendation network (see Sec- 
tion [231) ■ The proposed algorithms are introduced in 
Section m including an analysis of related scalability 
issues in Section l4?8l We discuss experimental results 
in SectionOand we conclude with some final remarks 



and possible directions of future work in Section [S] 

2 Related work 

Theory and implementation concepts used in this 
work for PageRank centrality are based on the com- 
prehensive survey of Langville and Meyer [LM03j . 
This centrality measure for directed graphs is a vari- 
ation of eigenvector centrality which includes the no- 
tion of a random surfer, i.e., an imaginary surfer that, 
in arriving to a vertex with no out-links, jumps to a 
randomly chosen vertex. The PageRank algorithm is 
based on the iterated multiplication of the adjacency 
matrix of the directed graph (modified to add the 
random surfer), and a vector representing the prob- 
ability that a surfer is in a particular vertex. The 
iteration stops when each vector component does not 
change more than a given error e. Only a hundred of 
matrix multiplications are needed for e = 10^^ and 
standard parameters (see |LM03| for details). 

Basic topic-sensitive PageRank analysis was at- 
tempted biasing the general PageRank equation to 
special subsets of web pages by Al-Saffar and Heile- 
man |ASH07j , and using a predefined set of categories 
by Haveliwala |Hav02j extracted from the Open Di- 
rectory Project |ODp| . Although encouraging results 
were obtained in both works, they suffer from the 
limitation of a fixed number of topics biasing the 
rankings. Another variations of personalized PageR- 
ank were augmented with weights based on usage 
by Eirinaki and Vazirgiannis [EVP 5] and on access 
time-length and frequency by Guo et al. |GRP07| by 
previous users, they built a unique PageRank vector 
adapted to usage but the result is not user dependent 
nor query dependent as we prefer. 

Hotho et al. |HJSS06) adapted PageRank to work 
on a tripartite graph of users, tags and resources cor- 
responding to a folksonomy. They also developed 
a form of topic-biasing on the modified PageRank, 
but the generation of a faceted ranking implies a new 
computation of the adapted PageRank algorithm on 
the network for each new facet. 

There has also been some work done on faceted 
ranking of web pages. For example, the approach 
of DeLong, Mane and Srivastava [DMS06j involves 
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the construction of a larger multigraph using the hy- 
perhnk graph with each vertex corresponding to a 
pair webpage-concept and each edge to a hyperUnk 
associated with a concept. Subgraph ideas are sug- 
gested by them: "It might be faster to simply run 
PageRank on sub-graphs pertaining to each individ- 
ual concept (assuming there are a small number of 
concepts)." Although DeLong et al. [UMS06j obtain 
good ranking results for single-keyword facets, they 
do not support multi-keyword queries. 

Query-dependent PageRank calculation was intro- 
duced in Richarson and Domingos [RD02| to extract 
a weighted probability per keyword for each web- 
page. These probabilities are summed up to gener- 
ate a query-dependent result. They also show that 
this faceted ranking has, for thousands of keywords, 
computation and storage requirements that are only 
approximately 100-200 times greater than that of a 
single query-independent PageRank. As we show in 
Section 14.81 our facet-dependent ranking algorithms 
have similar time complexity. 

Scalability issues were also tackled by Jeh and 
Widom [JW02j criticizing offline computation of mul- 
tiple PageRank vectors for each possible query and 
preferring another more efficient dynamic program- 
ming algorithm for online calculation of the faceted 
rankings based on offline computation of basis vec- 
tors. They found that their algorithm scales well 
with the size of set H, the biasing page set, and they 
criticize previous ideas in [RD02| : "[Richarson and 
Domingos] suggested that importance scores be pre- 
computed offline for every possible text query, but 
the enormous number of possibilities makes this ap- 
proach difficult to scale." 

In this paper, we propose a different alternative to 
the problem of faceted ranking. Instead of comput- 
ing offline the rankings corresponding to all possible 
facets, our solution requires only the offline compu- 
tation of a ranking per tag. A faceted ranking is 
generated by adequately merging the rankings of the 
corresponding tags. Section [4] deals with different ap- 
proaches to the merging step. 



3 Construction of a tagged 
graph 

In this section we introduce the basic definitions re- 
lated to tagged graphs and we present the network 
analysis of two real cases. 

3.1 Basic definitions 

Let G = [N, E, T) be a simpleQ directed graph with 
tags on the edges, a tagged graph. N is the set of 
vertices {ui, . . . , E is the set of edges and T{e) 
is the set of tags {ti, . . . , t^^ } associated with edge e 
in E. lie ^ E then T{e) := 0. We shall caU a certain 
set of tags F C IJ^g^; T(e) a facet. 

Let M = {(iti, TOi, Ti), . . . , (ur, rrir, T^)} be a set 
of tagged contents, where Ui is the user, is the 
content and is the preferred set of tags included 
by the useiH, and let V — {(c'^, m'^), . . . , (c^, m'p)} the 
set of favorite recommendations, where c[ is a recom- 
mcnder user and is the recommended contenll^l, 
then a tagged graph G — {N, E, T) is build, where 

N := {u, : 3i{u„m^,Ti) e M}U 

{c;-:3j(c;-,TO;-)eF}, 

E -.^ {{c'^.Uk) : (4,m;.) e FAK,m;.,rfc) G Af}, 
and 

T({c'^,uu)) :=m:(c;,m;)eyA 

{uk,m'^,Tk) eM A ic'j,Uk) G E} . 

We show an example of the application of these def- 
initions in Figure [TJ 

Given G = {N,E,T) and a tag t then G{t) := 
{N',E',T') is a tagged subgraph, where E' = {e : 
e e EAt e T(e)}, N' = {a, 6 : (a, 5) G E'} and 
r ^{te T(e') : e' G E'}. 

-^Not a multigraph. 

■^In the rest of the paper user or vertex will be used indis- 
tinctly to mean a vertex in a tagged graph as an abstraction 
of the webpage where the user publishes his/her content, and 
edges or links will be used when referring to edges in a tagged 
graph build using favorite recommendations. 

■^Each content is considered unique, i.e., different users do 
not upload the same content. 
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Mo = 



{(A, songl, {blues}) Vo = {(A, song2) and Flickr [Fli] in the following ways 
(B, song2, {blues,jazz}) (B, song4) 

(C, song3, {blues}) (B, songS) 

(C, song4,{ja.zz}) (A, song3) 

(D, songS, {blues}) (A, song4) 

(D, song 6, {rock}) } (C, song6) } 



• regarding the tagging rights, both are self- 
tagging systems; 



blues ,jazz 




blues, jazz 



Figure 1 : Example of construction of a tagged graph 
from a set of contents Afp and a set of recommenda- 
tions Vq. 



If Gi = {Ni,EuTi) and G2 = (iV2,£^2,T2) are 
graphs then d n G2 := {N\ E', T') where E' ^ Eif] 
E2, N' = {a,b : [a,b) e E'} and T'{e) := Ti(e) n 
T2{e). Also Gi U G2 := {N\ E', T') where E' = Ei\J 
E2, N' = {a,6 : (a, 6) G E'} and T'(e) := Ti(e) U 
r2(e). 

The conjunction and disjunction graphs can be de- 
fined as clh A . . .Atk-i A tk) ■■= G{ti A ... A tk^i) n 
G{tk) (see Figure |6(c)[ ) and G{ti V . . . Vtfc 1 V tk) := 
G(ti V . . . V U G{tk) (see Figure |6(d)D . 

The number of edges of a graph G is denoted 
£'(G)| and the number of vertices in a graph is de- 
noted by \N{G)\. 



3.2 Two real cases: 
Flickr 



YouTube and 



In this section, we present two examples of collabo- 
rative tagging systems where content is tagged and 
recommendations are made. These systems actually 
rank content according to the number of visits, rec- 
ommendations or relevance of the text accompanying 
the content. However, to our knowledge, no use of 
graph-based faceted ranking is made. 

The taxonomy of tagging systems in Marlow et 
al. |MNBD06| allows us to classify YouTube [^u] 



• regarding the aggregation model, they are set 
systems; 

• regarding the object-type, they are called non- 
textual systems; 

• regarding source of material, they are classified 

as user- contributed; 

• finally, regarding tagging support, while 
YouTube can be classified as a suggested tagging 
system, Flickr must be considered a blind 
tagging system. 

In our first example the content is multimedia 
in the form of favorite videos recommended by 
users. The information was collected from the service 
YouTube |You| using the pubhc API crawling 185852 
edges and 51490 vertices in Breadth-First Search 
(BFS) order starting from the popular user jcl5m 
that had videos included in the top twenty top rated 
videos during April 2008. From this information and 
following the the definitions in Section 13.11 we con- 
structed a complete tagged graph G and several sam- 
ple subgraphs such as G [music W funny), G{music), 
G{funny) and Glrnusic A funny) (other subgraphs 
present a similar behavior). Table [1] presents the 
number of vertices and edges of each of these net- 
works. We must note that mandatory categorical 
tags such as Entertainment, Sports or Music, al- 
ways capitalized, were removed in order to include 
only tags inserted by users. 



Graph 


vertices 


edges 


G 


51,490 


185,852 


G{music V funny) 


18,368 


26,388 


G{music) 


12,849 


10,273 


G{funny) 


8,734 


13,392 


G{music A funny) 


1,406 


1,147 



Table 1: Sizes of the video tagged graph and some of 
its subgraphs. 



4 



In our second example the content are photos 
and the recommendations are in the form of fa- 
vorite photofl The information was coUected from 
the service Flickr [Fhj using the pubHc API crawl- 
ing 229709 edges and 35210 vertices in BFS or- 
der starting from the popular user junku-newcleus. 
The complete tagged graph G and the sample sub- 
graphs G{blue V flower), G(blue), G{f lower) and 
G{blue A flower) were constructed. The number of 
vertices and edges of these graphs are shown in Ta- 
ble [2 



Graph 


vertices 


edges 


G 


35,210 


229,709 


G(blue V flower) 


12,921 


20,105 


G{blue) 


10,241 


11,703 


G(f lower) 


7,032 


9,566 


G{blue A flower) 


1,551 


1,164 



Table 2: Sizes of the photo tagged graph and some 
of its subgraphs. 



3.3 Network analysis 

Graph analysis was made using the tool Network 
Workbench [N06j . except for the calculation of 
PageRank. Figures [2l [3] and [4] show vertex indegree 
distribution, vertex outdegree distribution and cor- 
relation of indegree of in-neighbors with indegree of 
vertices for the YouTube and Flickr networks. All 
graph-analytical parameters, except those for small 
subgraphs like G{music A funny) were binned and 
plotted in log-log curves. This is the reason why some 
degree points appear below zero and one (x-axis) , be- 
cause there exist vertices with either indegree or out- 
degree equal to zero. 

Vertex indegree, in both video and photo net- 
works, is characterized by a power-law distribution: 
P{k) « k-f, where 2 < 7 < 3 (see Figure H]). Ran- 
dom variables modelled by this type of heavy-tailed 
distributions have a finite mean, but infinite second 
and higher non-central moments. Furthermore, there 
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Figure 2: Binned indegree distribution 



^Only the first fifty favorites photos of each user were re- 
trieved. 
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Figure 3: Binned outdegree distribution 
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Figure 4: Binned correlation of indegree of in- 
neighbors with indegree 
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veitex PageRank 

Figure 5: Binned Vertex PageRank distribution for 
YouTube (top) and Flickr (bottom) 



is a non- vanishing probability of finding a vertex with 
an arbitrary high indegree. Clearly, in any real- 
world network, the total number of vertices is a nat- 
ural upper-bound to the greatest possible indegree. 
However, experience with Internet related networks 
shows that the power-law distribution of the indegree 
does not change significantly as the network grows 
and, hence, the probability of finding a vertex with 
an arbitrary degree eventually becomes non-zero (for 
more details see, e.g., Pastor-Satorras and Vespig- 
nani |PSV04| ). 

Since recommendation lists are made by individual 
users, vertex outdegree does not show the same kind 
of scale-free behavior than vertex indegree. On the 
contrary, each user recommends only 20 to 30 other 
users on average (see Figure [3|). Moreover, since ver- 
tex outdegree is mostly controlled by human users, 
we do not expect its average to change significantly 
as the network grows. 

The correlation of indegree of in-neighbors with 
vertex indegree (see Figure |4]) indicates the exis- 
tence of assortative (positive slope) or disassorta- 
tive behavior (negative slope). Assortativeness is 
commonly observed in social networks, where peo- 
ple with many connections relates to people which is 
also well-connected. Disassortativeness is more com- 
mon in other kinds of networks, such as information, 
technological and biological networks (see, e.g., New- 
man [New02. ). In the favorite videos network there is 
no clear correlation (small or no slope), but the photo 
network there is a slight assortativeness indicating a 
biased preference of vertices with high indegree for 
vertices with high indegree (see Figured]). 

We also computed the PageRank of the sample 
graphs, removing dangling vertices with indegree 1 
and out degree 0, because most of them correspond 
to vertices which have not been expanded by the 
crawler (BFS), having the lowest PageRank (a simi- 
lar approach is taken in [PBMW98J). Figure [5] shows 
that PageRank distributions are also scale- free, i.e., 
they can be approximated by power law distributions. 
Note that the power law exponents are very similar 
for the complete tagged graph and subgraphs, on each 
network. 
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4 Faceted Ranking on Tagged 
Graphs 

Given a set M of tagged content, a set V of favorite 
recommendations and a tag set or facet the faceted 
ranking problem consists in finding the ranking of 
users according to facet F. 

In this section we present six different approaches 
to the faceted ranking problem using tagged graphs. 
The first two algorithms (iJ-intersection and E- 
union/A^- intersection in Sections 14.21 and 14.31 respec- 
tively) are not scalable for online queries because 
their computation requires the extraction of a sub- 
graph which might be very large in a large networl|^ 
and the calculation of the corresponding PageRank 
vector. Moreover, the offline computation of those 
rankings for each possible facet F C IJ^g^ T{e) is also 
unfeasible because the large number of such facets. 
However, they serve as a basis of comparison for the 
other four online algorithms because they are a good 
approximation to the desired result. 

We should note that the focus of this paper is on 
conjunction-based queries in which all words must be 
matched, as opposed to disjunction-based ones where 
the match of any word is sufficient. Conjunction- 
based queries are the most common type of boolean 
queries [ChrOSj . 

Before presenting faceted ranking algorithms, we 
need some preliminary definitions related to vertex 
centrality which are given in the following section. 

4.1 Vertex Centrality 

Given a graph G = {N,E), C{G) : iV ^ M is a 
vertex centrality function and i?(C(G)) : ^ N is a 
vertex ranking function associates a complete order 
such that to the highest centrality vertex of C{G) 
corresponds the number one, the second highest has 
number two and so on. PageRank C(G) is a vertex 
centrality function associating probabilities according 
to a random surfer traversing the graph G |LM03| . 
The vertex ranking function 7^(G) := R{C{G)) wiU 



be our default ranking for graphs. 



^We have observed that as the network grows the relative 
frequency of tags usage converges. Similar behavior was ob- 
served for particular resources by |GH06I . 



blues, jaz z 




blues , jazz 



blues, jazz 




blues , jazz 



blues .jazz 




blues, jazz 




blues, jazz 



Example subgraphs of a tagged graph: [(a) 
m\G(jazz) ~ 



(c) G{blues A jazz);\{d) 



Figure 6 
G{blues) 
G{blues V jazz) 



4.2 i?-intersection 

Given a set of tags, a ranking may be calculated 
by computing the centrality measure of the sub- 
graph corresponding to the recommendation edges 
which include all the tags. This approach, called 
^^-intersection, cannot be implemented for online 
queries, as explained above, but serves as a reason- 
able standard of comparison because we use the exact 
information available for the PageRank in a conjunc- 
tive query. 
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The E -intersection ranking for tagged graph G ac- 
cording to facet F — {ti, . . . ,tk} is 

7^(G(^l A... Atfc)). 



for this algorithm is O (/c x |i?(G'(ii))|max -t-mlogm), 
where |i?(G'(ii))|inax is computed for the largest sub- 
graph G{ti). 



As an example see Figure 6(c) Assuming a pre- '^'^ Single ranking 



viously built inverted index (Christopher |Chr08j . 
Chapter 1) for the tagged graph mapping tags 
into sets of edges, the complexity can be decom- 
posed on the retrieval time for each subgraph, 
which takes proportional to J2^=i ^^'^ 
the time of PageRank and sort algorithms, tak- 
ing ©(mlogm), where m — \E{G{ti A ... A 
tk))\- Then, the total time complexity for this al- 
gorithm is O {k X \E{G{ti))\niax + mlogm), where 
\E{G{ti))\tnax is computed for the largest subgraph 
Git,). 

4.3 i?-union/A^-intersection 

Consider the example given in Figure [1] under the 
query blues A rock. According to the i?-intersection 
algorithm, there is no node in the network satisfy- 
ing the query. However, it may seem reasonable to 
return node Z? as a response to such search. In or- 
der to take into account this case, we devised an- 
other algorithm called i?-union/7V-intersection. In 
this case, the union of all edge recommendations per 
tag is used when computing the PageRank, but only 
those vertices involved in recommendations for all 
tags are kept. The latter filtering is included because 
we want vertices recommended for each of the tags 
in the facet. 

The E -union/N -intersection ranking for vertex 
71 in a tagged graph G according to facet F — 
{ti, . . . ,ife} is 

R{C{G{tiW ...VtkMn), 

where C is restricted to vertices in vertex intersec- 
tion N{G{ti)) n ... n N{G{tk)), the other vertices 
having centrality 0. Note that, in general, there are 
more vertices in iV(G(ti)) n . . . n N{G{tk)) than in 

N{G{ti)n...nG{tk)). 

The time complexity of this algorithm is propor- 
tional to ^i^i\E{G{ti))\ -\- mlogm, where m — 
\E{G{ti V . . . Vife))|. Then, the total time complexity 



A simple online faceted ranking consists of a mono- 
lithic ranking, without considering the facet, which is 
then filtered to exclude those vertices that are not re- 
lated to all tags in the facet. That is, one ranks by the 
monolithic global rank of the complete tagged graph 
and the only vertices remaining for facet {ti, . . . ,tk} 
are the ones in 

N (Git,)) n...n N{G{tk)). 

Assuming a precomputed inverted index, mapping 
tags into nodes, the complexity of this algorithm is 
0{kx |Af(G(ii))|rriax), whcrc k is the number of differ- 
ent tags in the facet, and |A^(G(ii))|niax is computed 
for the biggest subgraph G{ti). It is also possible to 
retrieve a (small) constant number of top elements to 
intersect, yielding a time complexity of 0{k). 

4.5 Pi?-product 

In order to approximate efficiently the edge intersec- 
tion we can precompute individual rankings for each 
tag and then combine them by element-wise multipli- 
cation. This approximation is inspired on the proba- 
bility product of independent events. 

If Gi and G2 are subgraphs of graph G we define 
PageRank ranking product as 

7e(Gi) • 7^(G2) := R{C{Gi) ■ C(G2)), 

where (C(Gi) •C(G2))(n) C(Gi)(n) •C(G2)(n) (real 
product). The PR-product ranking for tagged graph 
G according to facet F = {ti, . . . ,tk} is 



l[n{Git,)). 



i=0 

Assuming the individual rankings for all tags have 
been computed, the complexity of this algorithm is 
0{k X I Af(G(ti))|max)- Here, it is also possible to re- 
duce the time complexity to 0{k) taking a (small) 
constant number of top elements to make the prod- 
uct. 
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4.6 i?-sum 

Consider a recommendation graph G larger than that 
in Figure[T]and the query blues A jazz. Assume that 
the PageRank of the top three nodes in the rankings 
corresponding the subgraphs G{blues) and G{jazz) 
are as given in Table [3l Ignoring other nodes, the 
ranking given by the Pi?-product rule is a, b and 
c. However, it may be argued that node b shows a 
better equilibrium of PageRank values than node a. 
Intuitively, one may feel inclined to rank b over a 
given the values in the table. In order to follow this 
intuition, we devised the i?-sum algorithm which is 
also intended to avoid topic drift inside the queried 
facet, that is, any tag prevailing over the others. 

The R-sum ranking for a tagged graph G according 
to facet F = {ti, . . . , tk} is 

k 
i=0 

where we define PageRank ranking sum as 

7^(Gl) + 7^(G•2) i^(-(7^(Gl) + 7^(G2))). 

Notice that in this sum we are using as centrality the 
sum of ranking positions in a reverse order, and ac- 
cording to the i?-sum algorithm, the ranking of nodes 
in the example of Table [3] is 6, a and c. 

The complexity of this algorithm is similar to that 
of Pi?-product. 



Node 


C(G{blues)) 


C{GUazz)) 


P7?-pr. 


i?-sum 


a 


0.75 


0.04 


0.03 


4 


b 


0.1 


0.1 


0.01 


3 


c 


0.01 


0.05 


0.005 


6 



Table 3: Comparison of Pi?-product and _R-sum in 
an example. 



4.7 r-A^-intersection 

In this case, edge intersection is computed involving 
only vertices (and associated edges) that are on the 



top w positions of the individual rankings. The t-N- 
intersection ranking for tagged graph G according to 
facet F — {ii, . . . , tfe} is 

7^|^^G(r(7^(G(^0),u'))(^oj , 

where t{TZ{G),w) is the set of vertices including 
the top w vertices ranked using PageRank and 
G({a,6,...}) is the maximal subgraph of G includ- 
ing vertices {a, 6, . . .} and edges connecting them. In 
other words, this algorithm has the following steps: 
(i) for each ti, the subgraph G{ti) is constructed; (ii) 
a ranking of users is computed on the basis of the 
PageRank of G{ti); (iii) the w winners of each ti- 
associated ranking are extracted; (iv) given a facet 
F = {ti, ■ ■ ■ ,tk}, a new subgraph including only 
the winners for tag ti is constructed; (v) a facet- 
associated ranking is constructed based on the new 
graph. Steps (i)-(iii) are computed offline. In this 
presentation, we have fixed the number of top items 
selected at five hundred {w — 500). 

Assuming the individual rankings for k tags has 
been computed, the complexity of this algorithm is 
0{k). 

4.8 Scalability Analysis 

As noticed by Langville and Meyer [LMOSj , the num- 
ber of iterations of PageRank is fixed when both the 
tolerated error and other parameters are fixed, yield- 
ing one hundred for e = 10~^ (see Section [2]). As 
each iteration consists of the sparse adjacency ma- 
trix multiplication, the time complexity of PageRank 
is linear on the number of edges of the graph. In our 
case, given a tagged graph Go = {Nq, Eq, Tq), for each 
tag there is a corresponding subgraph with a known 
size. Then the total temporal and spatial complexity 
of the faceted PageRank for all individual tags is 

where Tq := UeeEo ^o(^)' complete set of tags. 

Therefore, if the average number of tags per edge 
is constant or grows very slowly as the graph grows. 
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then the algorithms in Sections 14.51 14.61 and 14.71 are 
scalable, linear on the number of edges of the com- 
plete tagged graph. This can be verified empirically 
on Figure [71 showing that distribution of tags per 
edges falls quickly, having a mean of 9.26 tags per 
edge for the YouTube tagged graph and 13.37 for 
the Flickr tagged graph. These are not heavy-tailed 
distributions and, since tags are manually added to 
each uploaded content, we do not expect the aver- 
age number of tags per recommendation to increase 
significantly with network growth. 
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10000 


-—^z^, ^ 








1000 
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100 
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0.1 











1 10 100 1000 

#tags 

Figure 7: The distribution of number of tags per 
edge. 

In our experiments the computation of all the 
faceted singleton tag rankings (104, 927 tags) for 
the video network sample took 211.4 times more 
time than the single ranking for the complete 
tagged graph. Meanwhile the photo network sample 
(283, 093 tags) took 1744.9 times more time. 

Our merging algorithms work in real-time because 
they use only the top w results, where w is a small 
fixed number like 500 or 1000. Choosing an appropri- 
ate w for an applicatior{^ will enable it to store only 
the w top elements of each single-tag facet. 

5 Experimental results 

In this section, we compare the behavior of the 
algorithms presented in Section [H As a basis of 

®How to choose a good w is beyond the scope of this paper. 



comparison we use two algorithms whose online 
computation is unfeasible, but which are intu- 
itively reasonable: ^^-intersection (Section 14. 2p 
and i?-union/A^-intersection (Section 14. 3p . In 
order to quantify the "distance" between the re- 
sults given by two different algorithms, we use 
two ranking similarity measures, OSim |Hav02] 
and KSim |Ken38[ IHav02j . The first measure, 
OSim{ri, 7-2) indicates the degree of overlap between 
the top n elements of rankings ri and r2. We define 
the overlap of two sets A and B (each of size n) to 
be \AriB\/n. The second measure, KSim{ri,r2) = 

\{u, v) : r[, r'2 same order V(m, v),U ^ v\ / \U\{\U\ — 1) 

where U in the union of all elements in rankings 
ri and r2, r'l is ri extended with U ~ r2 and r2 is 
extended analogously to obtain Tj. This measure is 
a variant of Kendall's distance that considers the 
relative orderings, i.e., counts how many inversions 
are in a determined top set. In both cases, values 
closer to mean that the results are not similar and 
closer to 1 mean the opposite. 

5.1 Favorite videos network 

Samples include all facets of tag pairs {tj,tk} ex- 
tracted from the 99 most used tags of the networlfl. 
That is, 4851 tag pairs compared with their simi- 
larities averaged. For each tag pair the proposed 
merging algorithms (Pi?-product, i?-sum and t-N- 
intcrsection) were compared with the reference algo- 
rithms (£■- intersection and _E- union/ A^- intersection) 
using OSim and KSim to measure the rankings' sim- 
ilarity. Some of the tags are: music, funny, comedy, 
live, guitar, rock, super, dance, animation, parody, 
song, mario, game, new, tv, pop, john, love, world. 

Table |4] presents a summary of the comparisons for 
the favorite videos network, where we display aver- 
aged similarities for different top sizes of ranked users. 
Figures[8]and[9]also show a more detailed summary of 
results for the OSim metric (because it discriminates 
different situations better than KSim). The a;-axis 
corresponds to the number of vertices resulting from 
the basis of comparison algorithm (i?-intersection or 

^Some tags like you, video or youtube which give no infor- 
mation were removed from the experiment. 
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i?- union/ A''-intersection) and the y-axis to the top 
number n of vertices used to compute the similari- 
ties. The similarity results (between and 1) falling 
in each of the log-log ranges were averaged. Observe 
that darker tones correspond to values closer to 1, 
i.e., more similar results. White spaces correspond 
to cases for which there are no data, e.g., whenever 
the y coordinate is greater than intersection size. 

5.2 Favorite photos network 

Experiments with Flickr were similar, top 99 tags 
paired to form 4851 tag pairs. A small sample of the 
top 99 tags is: bw, portrait, nature, bravo, sky, blue, 
water, soe, flower, light, clouds, sunset, red, film, 
macro, white, landscape, green, girl, blackandwhite. 

Table [5] as well as Figures [TOl and [TT] summarize the 
results. 



Average similarity to i?-intersection 



Algorithm 


OSim|KSim 
top 8 top 16 top 32 


Single 

Pi?-product 

i?-sum 

T-A^-inters 


0.07|0.48 0.09|0.49 0.11|0.50 
0.44 0.59 0.43 0.60 0.42 0.60 
0.520.62 0.520.63 0.52 0.64 

0.28 0.51 0.34 0.54 0.39 0.56 


Average similarity to _E-union/A^-intersection 


Algorithm 


OSimjKSim 
top 8 top 16 top 32 


Single 

Pi?-product 

i?-sum 

T-A^-inters 


0.17|0.5G 0.21|0.51 0.27|G.53 
0.500.57 0.590.62 0.67 0.66 

0.28 0.52 0.32 0.54 0.38 0.56 
0.19 0.50 0.22 0.52 0.26 0.53 



Table 5: Photos network: Comparison of ranking al- 
gorithms 



Average similarity to _E-intersection 



Algorithm 


OSimjKSim 
top 8 top 16 top 32 


Single 

Pi?-product 

i?-sum 

T-A^-inters 


0.08|0.48 0.10|0.50 0.13|0.51 
0.36 0.56 0.37 0.58 0.39 0.59 
0.530.63 0.530.64 0.52 0.66 

0.15 0.49 0.15 0.51 0.10 0.51 


Average similarity to i?-union/A^-intersection 


Algorithm 


OSimjKSim 
top 8 top 16 top 32 


Single 

Pi?-product 

i?-sum 

r-A'^-inters 


0.31j0.53 0.34j0.55 0.39j0.56 
0.720.70 0.780.74 0.83 0.79 

0.35 0.54 0.42 0.56 0.50 0.59 
0.13 0.49 0.12 0.51 0.09 0.51 



Table 4: Videos network: Comparison of ranking al- 
gorithms 



5.3 Discussion 

As can be appreciated from Tables [4][5] and Figures 
[Hini the Single Ranking algorithm gave the worst re- 
sults in most cases. 



Since the r-A^-intersection algorithm is based on 
retaining only the 500 top-ranked users for each tag, 
it is natural to observe a worse OSim measure than 
the other algorithms especially for larger than 500- 
node intersections. However, this algorithm gives 
worse results even for smaller intersections. This fact 
is explained by the relevance of a large number of 
-recommendations of low-ranked users when comput- 
ing the PageRank in both the i?-interscction and the 
_i?-union/Af- intersection cases. Also note that the r- 
A^-intersection approach gave better results on the 
photo network than in the video network. A possible 
cause is the assortativeness of photo network (see Fig- 
ure [4] and Section [3?3|l . Indeed, since assortativeness 
implies that users with many recommendations are 
preferentially recommended by users with also many 
recommendations, the relevance of low-ranked users 
in the computation of the centrality measure is lower. 

There is a remarkable improvement using algo- 
rithm P-sum compared to the other merging al- 
gorithms when considering the similarity to the 
i?-intersection standard on both networks. Also, 
the best merging algorithm for the similarity with 
the second standard P- union/ A^- intersection is PP- 
product merging algorithm. 
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6 Summary 

We have proposed different algorithms for merging 
faceted-rankings of users in coUaborative tagging sys- [EOM] 
terns which gave results comparable to those of two 
reasonable standards. We have also analyzed the 
scalability of this approach. 1^ ' 

A prototypic application called Egg-O-Matic is 
available online [EOMj including ranking merging R- 
sum to approximate the i?-intersection ranking, in a 
mode called "all tags, same content", and including 
the ranking merging we called Pi?-product to approx- 
imate the iS-union/A'^-intersection ranking, in a mode 
called "all tags, any content". jp^^j 

Another step that can be taken to reduce tag- 
dimensionality is clustering to agglomerate them. [GH06] 
This work also opens the path for a more complex 
comparison of reputations, for example by integrat- 
ing the best positions of a user even if the tags in- 
volved are not related {disjunctive queries) in order 
to summarize the relevance of a user generating con- [GRP07] 
tent on the web. It is also possible to extend the 
algorithms in Section 3] to merge of rankings gen- 
erated from different systems (cross-system ranking) 
looking to obtain a ranking of users using multiple 
collaborative tagging systems. 
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E-intersection vs. Single 



E-intersection vs. PR-product 





5 13 46 175 674 2618 5 13 46 175 674 2618 

Intersection Size Intersection Size 



Figure 8: Videos network: Average similarity {OSim) to ^^-intersection 



E-union/N-intersection vs. Single E-union/N-intersection vs. PR-product 




21 30 63 196 718 2763 21 30 63 196 718 2763 

Intersection Size Intersection Size 



Figure 9: Videos network: Average similarity {OSim) to £^-union/iV-intersection 
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E-intersection vs. PR-product 
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Figure 10: Photos network: Average similarity to E'-intersection 



E-union/N-intersection vs. Single 



E-union/N-intersection vs. PR-product 
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E-union/N-intersection vs. R-sum 
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E-union/N-intersection vs. tau-N-intersection 
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Figure 11: Photos network: Average similarity to £/-umon/A^-intersection 
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